0%

(2017) Dual-Path Convolutional Image-Text Embedding

Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embedding with instance loss[J]. arXiv preprint arXiv:1711.05535, 2017.



1. Overview


1.1. Motivation

  • existing methods uses RNN for text feature learning and employs off-the-shelf CNN for image extraction
  • ImageNet pre-trained models do not preserve rich image details that are critical for matching languages


In this paper, it proposed

  • CNN for fine-tuning the visual and textual representation (network only contains Conv, Pooling, ReLU, BN)
  • instance loss according to viewing each multimodal data pair as a class


1.2. Contribution

  • dual-path CNN model
  • instance loss
  • outperform on FLickr30K, MSCOCO, CUHK-PEDES

1.3.1. Model for Image Recognition

  • fixed CNN feature as input

1.3.2. Model for Natural Language Understanding

  • word2vec
  • RNN, directional LSTM
  • CNN to conduct machine translation, 9.3x speed up

1.3.3. Multi-modal Learning

  • class-level retrieval. leverage the class labels in training set
  • instance-level retrieval. match image-text pairs, does not use any class label

In this paper, it focus on instance-level retrieval and propsoed instance loss.

1.4. Dataset

  • Flickr30k. 5 sentences, avg 10.5 words per sentence after remove rare word
  • MSCOCO. avg 5 sentences, avg 8.7 words after rare word removal
  • CUHK-PEDES. 2 sentences, avg 19.6 words after removal



2. Method





2.1. Deep Image CNN

  • pre-trained models can still provide for good CNN initialization
  • input. 224x224
  • output. 2048 dimension vector


2.2. Deep Text CNN

2.2.1. Text Processing

  • convert sentence to code T [n x d]. (ont-hot)


  1. n. length of the sentence (set to fixed length)
  2. d. size of the dictionary
  • use word2vec to filter out rare words
  • pad zeros to T, if less than fixed length words
  • reshape T to 1 x 32 x d (h, w, c)
  • position shift (more robust). pad random number of zeros at the beginning and the end of sentence
    • left alignment. only padding at the end of the sentence.

2.2.2. Deep Text CNN

  • input. T (1 x 32 x d)
  • output. 2048 dimension vector
  • the filter size of the first conv is 1 x 1 x d x 300, two method to init
    • random initialization
    • using d x 300 matrix from word2vec for initialization (better)
  • kernel of Conv. 1x2, every two neighbour components may form a phrase containing content information


2.3. Loss Function

2.3.1. Ranking Loss



  • I. visual input
  • T. text input
  • I_a/T_a. the same image/text group
  • I_n/T_n. negative sample
  • α. margin

  • the convergence of ranking loss requires both image and text branches converge

  • may be prone to getting stuck in a local minimum

2.3.2. Instance Loss

  • assumption. each image/text group is distinct
  • instance loss is a softmax loss classifies an image/text group into one of a large number of classes


  • L. loss
  • P. probability
  • P(c). predicted possibility of the right class c

enforce shared weight in the fully connected layer for the two modalities, otherwise the learned image and text features mat exist in totally different subspace.

2.3.3. Total Loss



  • stage-1. only instance loss, λ_1 = 0

    • so the ranking loss can fing a better optimisation for both modalities in the second stage
    • using the instance loss alone can lead to a competitive result
    • instance loss encourages the model to find the fine-grained differences, such as ball, stick,..


  • stage-2. ranking loss + instance loss

2.3.4. Training Stage

  • stage-1. fixed pre-trained image CNN, use instance loss to tune the remaining part
    • if we train the image and text CNNs simultaneously, the text CNN may comprise the pre-trained image CNN
  • stage-2. instance loss + ranking loss to fine-tune the entire network



3. Experiments


3.1. Metric

  • Recall@K. possibility that true match appears in the top K of the rank list
  • Median Rank. median rank of the closest ground truth result in the rank list

3.2. Details

  • SGD + fixed 0.9 momentum
  • Matconvnet Framework
  • 224x224 random crop from shorter side 256
  • horizontal flipping for image
  • position shift for text
  • 0.75 dropout
  • max text length 32 for Flickr30K and MSCOCO, 56 for CUHK-PEDES
  • LR. 0.001
  • α=1

3.3. Comparison



3.4. Ablation Study

3.4.1. Loss



  • stage-1. ranking loss focuses on inter-modal distance, may be hard to tune the visual and textual features simultaneous at the beginning
  • stage-1. instance loss performs better, which focusses more on learning intra-modal discriminative descriptors
  • instace loss help to regularise the model

3.4.2. Fine-tune



  • fine-tune in stage-2 helps

3.4.3. Initialization



  • word2vec initialization helps

3.4.4. Position Shift vs Left Alignment



3.5. Training Time

  • image CNN ~119ms per image batch (32) on 1080Ti
  • text CNN ~117ms per sentence batch (32)

image feature and text feature can be simultaneously calculated, the model can run in a parallel style efficiently.